OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products

نویسندگان

  • Tapas Kanungo
  • Gregory A. Marton
  • Osama Bulbul
چکیده

Characterizing the performance of Optical Character Recognition (OCR) systems is crucial for monitoring technical progress, predicting OCR performance, providing scienti c explanations for the system behavior and identifying open problems. While research has been done in the past to compare performances of two or more OCR systems, all assume that the accuracies achieved on individual documents in a dataset are independent when, in fact, they are not. In this paper we show that accuracies reported on any dataset are correlated and invoke the appropriate statistical technique | the paired model | to compare the accuracies of two recognition systems. Theoretically we show that this method provides tighter con dence intervals than methods used in OCR and computer vision literature. We also propose a new visualization method, which we call the accuracy scatter plot, for providing a visual summary of performance results. This method summarizes the accuracy comparisons on the entire corpus while simultaneously allowing the researcher to visually compare the performances on individual document images. Finally, we report on the accuracy and speed performances as a function of scanning resolution. Contrary to what one might expect, the performance of one of the systems degrades when the image resolution is increased beyond 300 dpi. Furthermore, the average time taken to OCR a document image, after increasing almost linearly as a function of resolution, suddenly becomes a constant beyond 400 dpi. This behavior is most likely because the OCR algorithm samples the images at resolutions 400 dpi and higher to a standard resolution. The two products that we compare are the Arabic OmniPage 2.0 and the Automatic Page Reader 3.01 from Sakhr. The SAIC Arabic dataset was used for the evaluations. The statistical and visualization methods presented in this article are very general and can be used for comparing accuracies of any two recognition systems, not just OCR systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Performance Evaluation of Two Arabic OCR Products

Numerous Optical Character Recognition (OCR) companies claim that their products have near-perfect recognition accuracy (close to 99.9%). In practice, however, these accuracy rates are rarely achieved. Most systems break down when the input document images are highly degraded, such as scanned images of carbon-copy documents, documents printed on low-quality paper, and documents that are n-th ge...

متن کامل

Paired Model Evaluation of OCR

Characterizing the performance of Optical Character Recognition (OCR) systems is crucial for monitoring technical progress, predicting OCR performance, providing scientiic explanations for system behavior and identifying open problems. While research has been done in the past to compare the performances of OCR systems, all methods assume that the accuracies achieved on individual documents in a...

متن کامل

Paired Model Evaluation of OCR Algorithms

Characterizing the performance of Optical Character Recognition (OCR) systems is crucial for monitoring technical progress, predicting OCR performance, providing scienti c explanations for system behavior and identifying open problems. While research has been done in the past to compare the performances of OCR systems, all methods assume that the accuracies achieved on individual documents in a...

متن کامل

Persian/Arabic Baffletext CAPTCHA

Nowadays, many daily human activities such as education, trade, talks, etc are done by using the Internet. In such things as registration on Internet web sites, hackers write programs to make automatic false registration that waste the resources of the web sites while it may also stop it from functioning. Therefore, human users should be distinguished from computer programs. To this end, this p...

متن کامل

Improved CHAID algorithm for document structure modelling

This paper proposes a technique for the logical labelling of document images. It makes use of a decision-tree based approach to learn and then recognise the logical elements of a page. A state-of-the-art OCR gives the physical features needed by the system. Each block of text is extracted during the layout analysis and raw physical features are collected and stored in the ALTO format. The data-...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999